Project Description¶
Data cleaning is a fundamental skill for data engineers. It involves reading, modifying, splitting, and saving data in a clean and organized way.
In this project, we will apply our data cleaning skills to work with information about marketing campaigns conducted by a bank. We will clean and prepare the data so it can be used effectively for analysis and storage.
Personal loans are an important source of income for banks. For example, in the United Kingdom, the typical interest rate for a two-year personal loan is about 10%. While this rate might seem moderate, UK consumers borrowed around £1.5 billion in September 2022, which could generate approximately £300 million in interest for banks over two years.
Our task is to help a bank clean the data they collected from a recent marketing campaign aimed at encouraging customers to take out personal loans. The bank plans to run more campaigns in the future, so they want the data to be structured and formatted in a specific way. This will allow them to load the cleaned data easily into a PostgreSQL database and import future campaign data smoothly.
We have been given a CSV file called "bank_marketing.csv"
. Our job is to clean, reformat, and split the data into three separate CSV files. Below are the details of these files along with their columns and cleaning requirements.
client.csv
¶
Column | Data Type | Description | Cleaning Requirements |
---|---|---|---|
client_id |
integer | Client ID | No changes needed |
age |
integer | Client's age in years | No changes needed |
job |
object | Client's type of job | Replace "." with "_" |
marital |
object | Client's marital status | No changes needed |
education |
object | Client's level of education | Replace "." with "_" and "unknown" with np.NaN |
credit_default |
bool | Whether the client’s credit is in default | Convert to boolean: 1 if "yes" , otherwise 0 |
mortgage |
bool | Whether the client has a mortgage | Convert to boolean: 1 if "yes" , otherwise 0 |
campaign.csv
¶
Column | Data Type | Description | Cleaning Requirements |
---|---|---|---|
client_id |
integer | Client ID | No changes needed |
number_contacts |
integer | Number of contact attempts in current campaign | No changes needed |
contact_duration |
integer | Duration of last contact in seconds | No changes needed |
previous_campaign_contacts |
integer | Number of contact attempts in previous campaign | No changes needed |
previous_outcome |
bool | Outcome of previous campaign | Convert to boolean: 1 if "success" , otherwise 0 |
campaign_outcome |
bool | Outcome of current campaign | Convert to boolean: 1 if "yes" , otherwise 0 |
last_contact_date |
datetime | Last date client was contacted | Create from combining day , month , and a new year column with value 2022 Format as "YYYY-MM-DD" |
economics.csv
¶
Column | Data Type | Description | Cleaning Requirements |
---|---|---|---|
client_id |
integer | Client ID | No changes needed |
cons_price_idx |
float | Consumer price index (monthly indicator) | No changes needed |
euribor_three_months |
float | Euro Interbank Offered Rate three-month rate (daily indicator) | No changes needed |
By completing this project, we will strengthen our ability to clean and prepare real-world data for further analysis and database use. This will help ensure that the bank’s marketing data is accurate, consistent, and ready for future use.
import pandas as pd
import numpy as np
df = pd.read_csv('bank_marketing.csv', usecols=lambda column: column != "index")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41188 entries, 0 to 41187 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 client_id 41188 non-null int64 1 age 41188 non-null int64 2 job 41188 non-null object 3 marital 41188 non-null object 4 education 41188 non-null object 5 credit_default 41188 non-null object 6 mortgage 41188 non-null object 7 month 41188 non-null object 8 day 41188 non-null int64 9 contact_duration 41188 non-null int64 10 number_contacts 41188 non-null int64 11 previous_campaign_contacts 41188 non-null int64 12 previous_outcome 41188 non-null object 13 cons_price_idx 41188 non-null float64 14 euribor_three_months 41188 non-null float64 15 campaign_outcome 41188 non-null object dtypes: float64(2), int64(6), object(8) memory usage: 5.0+ MB
We will split and tidy bank_marketing.csv, storing as three DataFrames called client, campaign, and economics, each containing the columns outlined in the notebook and formatted to the data types listed.¶
# Split into the three tables
client = df[["client_id", "age", "job", "marital",
"education", "credit_default", "mortgage"]]
campaign = df[["client_id", "number_contacts", "month", "day",
"contact_duration", "previous_campaign_contacts", "previous_outcome", "campaign_outcome"]]
economics = df[["client_id", "cons_price_idx", "euribor_three_months"]]
Beginng with client data¶
client
client_id | age | job | marital | education | credit_default | mortgage | |
---|---|---|---|---|---|---|---|
0 | 0 | 56 | housemaid | married | basic.4y | no | no |
1 | 1 | 57 | services | married | high.school | unknown | no |
2 | 2 | 37 | services | married | high.school | no | yes |
3 | 3 | 40 | admin. | married | basic.6y | no | no |
4 | 4 | 56 | services | married | high.school | no | no |
... | ... | ... | ... | ... | ... | ... | ... |
41183 | 41183 | 73 | retired | married | professional.course | no | yes |
41184 | 41184 | 46 | blue-collar | married | professional.course | no | no |
41185 | 41185 | 56 | retired | married | university.degree | no | yes |
41186 | 41186 | 44 | technician | married | professional.course | no | no |
41187 | 41187 | 74 | retired | married | professional.course | no | yes |
41188 rows × 7 columns
client_columns = ['client_id', 'age', 'job', 'marital', 'education', 'credit_default', 'mortgage']
for col in client_columns:
print(f"Value counts for column: {col}\n")
print(client[col].value_counts()) # Correct way to access column
print("\n" + "-"*50 + "\n") # Separator for readability
Value counts for column: client_id client_id 0 1 27512 1 27454 1 27455 1 27456 1 .. 13730 1 13731 1 13732 1 13733 1 41187 1 Name: count, Length: 41188, dtype: int64 -------------------------------------------------- Value counts for column: age age 31 1947 32 1846 33 1833 36 1780 35 1759 ... 89 2 91 2 94 1 87 1 95 1 Name: count, Length: 78, dtype: int64 -------------------------------------------------- Value counts for column: job job admin. 10422 blue-collar 9254 technician 6743 services 3969 management 2924 retired 1720 entrepreneur 1456 self-employed 1421 housemaid 1060 unemployed 1014 student 875 unknown 330 Name: count, dtype: int64 -------------------------------------------------- Value counts for column: marital marital married 24928 single 11568 divorced 4612 unknown 80 Name: count, dtype: int64 -------------------------------------------------- Value counts for column: education education university.degree 12168 high.school 9515 basic.9y 6045 professional.course 5243 basic.4y 4176 basic.6y 2292 unknown 1731 illiterate 18 Name: count, dtype: int64 -------------------------------------------------- Value counts for column: credit_default credit_default no 32588 unknown 8597 yes 3 Name: count, dtype: int64 -------------------------------------------------- Value counts for column: mortgage mortgage yes 21576 no 18622 unknown 990 Name: count, dtype: int64 --------------------------------------------------
# Cleaning education column
client = client.copy() # This removes ambiguity
client["education"] = client["education"].str.replace(".", "_", regex=False)
client["education"] = client["education"].replace("unknown", np.NaN)
# Cleaning job column
client["job"] = client["job"].str.replace(".", "_")
# Clean and convert client columns to bool data type
for col in ["credit_default", "mortgage"]:
client[col] = client[col].map({"yes": 1,
"no": 0,
"unknown": 0})
client[col] = client[col].astype(bool)
Since tha client data has been cleaned, we will move on to Campaign data¶
campaign
client_id | number_contacts | month | day | contact_duration | previous_campaign_contacts | previous_outcome | campaign_outcome | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | may | 13 | 261 | 0 | NaN | NaN |
1 | 1 | 1 | may | 19 | 149 | 0 | NaN | NaN |
2 | 2 | 1 | may | 23 | 226 | 0 | NaN | NaN |
3 | 3 | 1 | may | 27 | 151 | 0 | NaN | NaN |
4 | 4 | 1 | may | 3 | 307 | 0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
41183 | 41183 | 1 | nov | 30 | 334 | 0 | NaN | NaN |
41184 | 41184 | 1 | nov | 6 | 383 | 0 | NaN | NaN |
41185 | 41185 | 2 | nov | 24 | 189 | 0 | NaN | NaN |
41186 | 41186 | 1 | nov | 17 | 442 | 0 | NaN | NaN |
41187 | 41187 | 3 | nov | 23 | 239 | 1 | NaN | NaN |
41188 rows × 8 columns
# create a campaign copy to avoid ambiguity
campaign = campaign.copy()
# Change campaign_outcome to binary values
campaign["campaign_outcome"] = campaign["campaign_outcome"].map({"yes": 1, "no": 0})
# Convert previous_outcome to binary values
campaign["previous_outcome"] = campaign["previous_outcome"].map({
"success": 1,
"failure": 0,
"nonexistent": 0
})
# Clean and convert outcome columns to bool
for col in ["campaign_outcome", "previous_outcome"]:
campaign[col] = campaign[col].astype(bool)
# Adding column 'year' in campaign data
campaign['year']= '2022'
# check the type of month and day
campaign[['day', 'month']].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41188 entries, 0 to 41187 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 day 41188 non-null int64 1 month 41188 non-null object dtypes: int64(1), object(1) memory usage: 643.7+ KB
# we will change the type of 'day' from int to str
campaign['day'] = campaign['day'].astype(str)
# Add last_contact_date column
campaign["last_contact_date"] = campaign["year"] + "-" + campaign["month"] + "-" + campaign["day"]
# Convert to datetime
campaign["last_contact_date"] = pd.to_datetime(campaign["last_contact_date"],
format="%Y-%b-%d")
# Drop unneccessary columns
campaign.drop(columns=["month", "day", "year"], inplace=True)
Now our data is cleaned, we splitted the data into three datframes. We can now save our resultant files individually.¶
# Save tables to individual csv files
client.to_csv("client.csv", index=False)
campaign.to_csv("campaign.csv", index=False)
economics.to_csv("economics.csv", index=False)